Postgres: Benchmark UUIDv4 vs UUIDv7 Primary Keys 🔑
Table of Contents
tl;dr code can be found here: mikeblum/pg-uuidv7-benchmark
Much has been written about the issues with using UUIDs as primary keys in Postgres and other RDBMS.
Rather than re-hash comparing v4 vs v7 I wanted to investigate how to practically validate if my db indexes were setup correctly and what the impact might be to disk / RAM / planner times.
Check out Buildkite’s excellent primer on why UUIDv7 (or time-bound globally unique identifiers) make for more efficient lookups:
Employing the latest 🐘 Postgres 15 - I used generate_series
to create benchmarks between the two primary keys.
Using naive default BTREE
indexes we see something quite surprising! There doesn’t appear to be a material difference between UUIDv4 and UUIDv7? 😖 We want to use p90
to factor out outliers aka MIN()
and MAX()
and get a more representative sample-set.
Generating UUIDv7 IDs #
As of now, the default uuid-ossp
Postgres extension doesn’t support UUIDv7
. But we can use application-level libraries to generate the ID with gofrs/uuid
. I prefer google/uuid
in general but it lacks UUIDv7 support (and the API is a bit terse). One could also write a function or stored procedure to off-load ID generation to the DB teir but I like being able to validate the logic all app-side.
UUIDv4 / UUIDv7: inserts vs lookups #
SELECT
AVG(insert_duration_ns) AS
insert_duration_ns_avg,
percentile_cont(0.90)
WITHIN GROUP (ORDER BY insert_duration_ns ASC) AS
insert_duration_ns_p90,
AVG(lookup_duration_ns) AS lookup_duration_ns_avg,
percentile_cont(0.90)
WITHIN GROUP (ORDER BY lookup_duration_ns ASC) AS lookup_duration_ns_p90
FROM
app.uuid_result
WHERE
version = 4;
SELECT
AVG(insert_duration_ns) AS
insert_duration_ns_avg,
percentile_cont(0.90)
WITHIN GROUP (ORDER BY insert_duration_ns ASC) AS
insert_duration_ns_p90,
AVG(lookup_duration_ns) AS lookup_duration_ns_avg,
percentile_cont(0.90)
WITHIN GROUP (ORDER BY lookup_duration_ns ASC) AS lookup_duration_ns_p90
FROM
app.uuid_result
WHERE
version = 7;
note: shorter / faster durations are better 📊
Looking at the generate_series
data set I originally bound it to 1 minute granularity over the last day:
-- casting resolves computation requirement
-- https://github.com/sqlc-dev/sqlc/issues/1995
SELECT ts::timestamp FROM generate_series(
- date_trunc('day', now()::timestamp) - INTERVAL '1 day',
+ date_trunc('day', now()::timestamp) - INTERVAL '1 month',
now()::timestamp,
INTERVAL '1 minute'
) AS ts;
Increasing the amount of data (over month-to-date) showed a slight improvement in lookup times using a UUIDv4
BTREE
index vs a UUIDv7
BRIN
index. Where things really differ is in the size of the index:
SELECT pg_size_pretty(pg_relation_size('app.uuid_v7_pkey'));
1664 kB
SELECT pg_size_pretty(pg_relation_size('app.idx_uuid_v7_id'));
24 kB
As we add more data the BTREE
index roughly grows with the size of the table:
SELECT pg_size_pretty(pg_total_relation_size('app.uuid_v4'));
7656 kB
SELECT pg_size_pretty(pg_total_relation_size('app.uuid_v7'));
5960 kB
showing a nearly ~+25% footprint simply based on the choice of primary key.
One might ask - if BRIN
indexes are so much more space-efficient why do we need both the PRIMARY KEY
BTREE
index in addition to the BRIN
index? Aren’t we losing out on the space efficiencies? Since BRIN
indexes don’t support uniqueness we still need an index that supports the primary key (PK) use case: guaranteed UNIQUE
and NOT NULL
.
Does it go to 11? #
🤔 What might be going on? One would think a Postgres BRIN index would be ideal as BRIN plays well with temporally bound data.
Given that there are two indexes I found a StackOverflow post pointing to using VACCUM
to force the planner to re-assess which index to use:
VACUUM — garbage-collect and optionally analyze a database
Specifically INDEX_CLEANUP
is set to AUTO
. We can see the effect by running VACUUM VERBOSE ANALYZE;
before doing the lookups:
VACUUM ( analyze true, index_cleanup true, verbose true ) app.uuid_v7;
but the VACUUM
doesn’t appear to do much:
INFO: vacuuming "postgres.app.uuid_v7"
INFO: launched 2 parallel vacuum workers for index cleanup (planned: 2)
INFO: finished vacuuming "postgres.app.uuid_v7": index scans: 0
pages: 0 removed, 368 remain, 1 scanned (0.27% of total)
tuples: 0 removed, 44142 remain, 0 are dead but not yet removable
removable cutoff: 2372128, which was 0 XIDs old when operation ended
index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
index "idx_uuid_v7_id_brin": pages: 3 in total, 0 newly deleted, 0 currently deleted, 0 reusable
avg read rate: 0.000 MB/s, avg write rate: 1.804 MB/s
buffer usage: 22 hits, 0 misses, 4 dirtied
WAL usage: 1 records, 1 full page images, 6057 bytes
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.01 s
INFO: analyzing "app.uuid_v7"
INFO: "uuid_v7": scanned 368 of 368 pages, containing 44142 live rows and 0 dead rows; 30000 rows in sample, 44142 estimated total rows
Query 1 OK: VACUUM
and we see the BTREE
index being used instead of the BRIN
one:
EXPLAIN ANALYZE SELECT * FROM app.uuid_v7 WHERE id = '0015022e-1787-4e26-993f-67694c2cafc6'::uuid LIMIT 1;
or
EXPLAIN ANALYZE SELECT * FROM app.uuid_v7 WHERE id::string = '0015022e-1787-4e26-993f-67694c2cafc6' LIMIT 1;
have no affect:
Limit (cost=0.29..8.31 rows=1 width=24) (actual time=0.036..0.037 rows=0 loops=1)
-> Index Scan using uuid_v7_pkey (BTREE index) on uuid_v7 (cost=0.29..8.31 rows=1 width=24) (actual time=0.035..0.035 rows=0 loops=1)
" Index Cond: (id = '0015022e-1787-4e26-993f-67694c2cafc6'::uuid)"
Planning Time: 0.224 ms
Execution Time: 0.068 ms
So if we can’t convince the planner to use the BRIN
index we can compare and contrast BTREE
vs BRIN
in separate columns:
EXPLAIN ANALYZE SELECT * FROM app.uuid_v7 WHERE id = '018c2c7b-1de6-73a0-842b-ec66f74c8993' LIMIT 1;
Limit (cost=0.29..8.31 rows=1 width=40) (actual time=0.127..0.128 rows=1 loops=1)
-> Index Scan using idx_uuid_v7_id (BTREE index) on uuid_v7 (cost=0.29..8.31 rows=1 width=40) (actual time=0.126..0.126 rows=1 loops=1)
" Index Cond: (id = '018c2c7b-1de6-73a0-842b-ec66f74c8993'::uuid)"
Planning Time: 0.461 ms
Execution Time: 0.178 ms
vs
EXPLAIN ANALYZE SELECT * FROM app.uuid_v7 WHERE id_brin = '018c2c7b-1de6-73a0-842b-ec66f74c8993' LIMIT 1;
Limit (cost=12.03..564.27 rows=1 width=40) (actual time=0.138..0.138 rows=1 loops=1)
-> Bitmap Heap Scan on uuid_v7 (cost=12.03..564.27 rows=1 width=40) (actual time=0.137..0.137 rows=1 loops=1)
" Recheck Cond: (id_brin = '018c2c7b-1de6-73a0-842b-ec66f74c8993'::uuid)"
Rows Removed by Index Recheck: 65
Heap Blocks: lossy=1
-> Bitmap Index Scan on idx_uuid_v7_id_brin (cost=0.00..12.03 rows=14739 width=0) (actual time=0.086..0.086 rows=2400 loops=1)
" Index Cond: (id_brin = '018c2c7b-1de6-73a0-842b-ec66f74c8993'::uuid)"
Planning Time: 0.357 ms
Execution Time: 0.191 ms
Interestingly, explicitly casting the param as ::uuid
doesn’t appear to have any impact:
EXPLAIN ANALYZE SELECT * FROM app.uuid_v7 WHERE id_brin = '018c2c7b-1de6-73a0-842b-ec66f74c8993'::uuid LIMIT 1;
Limit (cost=12.03..564.27 rows=1 width=40) (actual time=0.036..0.036 rows=1 loops=1)
-> Bitmap Heap Scan on uuid_v7 (cost=12.03..564.27 rows=1 width=40) (actual time=0.035..0.035 rows=1 loops=1)
" Recheck Cond: (id_brin = '018c2c7b-1de6-73a0-842b-ec66f74c8993'::uuid)"
Rows Removed by Index Recheck: 65
Heap Blocks: lossy=1
-> Bitmap Index Scan on idx_uuid_v7_id_brin (cost=0.00..12.03 rows=14739 width=0) (actual time=0.019..0.019 rows=2400 loops=1)
" Index Cond: (id_brin = '018c2c7b-1de6-73a0-842b-ec66f74c8993'::uuid)"
Planning Time: 0.116 ms
Execution Time: 0.059 ms
With the new schema to denote BTREE
vs BRIN
indexes we have the following:
SELECT
AVG(insert_duration_ns) AS
insert_duration_ns_avg,
percentile_cont(0.90)
WITHIN GROUP (ORDER BY insert_duration_ns ASC) AS
insert_duration_ns_p90,
AVG(lookup_duration_ns) AS lookup_duration_ns_avg,
percentile_cont(0.90)
WITHIN GROUP (ORDER BY lookup_duration_ns ASC) AS lookup_duration_ns_p90
FROM
app.uuid_result
WHERE
version = 7 AND id_idx = 'BRIN';
the V4 version remains unchanged given UUIDv4
s are random by design and a BTREE
would be optimal.
So even though BRIN
indexes use orders of magnitude less memory than BTREE indexes; there doesn’t seem to be as clear benefits for lookup queries. This seems to align with Cybertec’s take: CYBERTEC: btree vs. BRIN: 2 options for indexing in PostgreSQL data warehouses. An interesting takeaway is on the time to create a BTREE
vs BRIN
index if there is already a lot of data.
TODO?? #
some TODOs for next time:
- How many rows until
BRIN
beatsBTREE
for primary key (PK)? - Is there a way to force the Postgres planner to use
BRIN
overBTREE
?
In closing this has been an interesting deep dive into UUIDv7
. I think using UUIDv7 makes sense to use as the primary identifier for my Postgres projects going forward. Where there is more R&D to be done is what is the optimal index to use based on size / query time / and locality.